Embarking on my Machine Learning (ML) project I am delving into the world of Community Notes (CN) on X, formerly known as Twitter. I will use ML tools to find the best model to predict the number of ratings that each notes has received.
Community Notes, found on platforms like X (formerly Twitter), play a crucial role in combating misinformation and improving content moderation. They allow users to add context to posts, providing diverse viewpoints to counter potential biases. The algorithm guiding Community Notes emphasizes consensus, ensuring that it’s not just about majority agreement. This collaborative approach empowers users to contribute to a more informed online space and brings transparency to the fact-checking process. The significance of Community Notes lies in their ability to debunk misinformation, lessen the impact of misleading content, and encourage a collective effort toward promoting accuracy in digital conversations.
Community Notes stands out as an innovative platform where contributors collaboratively add context to potentially misleading posts, challenging conventional content moderation methods. The publication of a CN is driven not by a majority rule but by the agreement of contributors who have previously disagreed, creating a transparent, community-driven approach to combat misinformation.
This sounds like a great idea, but it has been proven sometimes good but insufficient2 or irrelevant3, and even susceptible to disinformation4, as you can see in more detail in the Wikipedia page dedicated to Community Notes (CN).
At the core of this exploration is the open-source algorithm powering CN, described as “insanely complicated.” This algorithm ensures that notes are rated by a diverse range of perspectives, incorporating an opinion classification based on contributors’ alignment with the left and right-wing political spectrum. It is only after people that previously disagree, agree on the helpfulness of a note that the note is posted. Therefore, the number of ratings that a note receives is very important to determine if the note is ever published, and how fast.
The project is centered around a vast data set comprising around 380,000 notes, each representing a collaborative effort to combat misinformation. Of particular significance is the attempt to predict the number of ratings received by each note, as this is a crucial determinant in deciding whether a note is published. This predictive aspect adds a layer of complexity to our analysis, aiming to uncover insights into the collaborative evaluation system and its impact on the publication of notes.
This notes can be related to any topic and even advertising. It is worth noting that the most rated notes was about a game.
The openness of the data invite scrutiny and analysis, fostering an environment where skepticism can be transformed into informed inquiry. Join me on this journey as we explore the intricacies of Community Notes.
The data from the notes and the ratings are open to anyone with an account on X. On the github page of CN you can also find the code and algorithm. Here are the sources:
Data from the project from X, Community Notes, can be found here.
And the code from Community Notes is in Github.
Since the data sets are very large, I save the final data set with the information I needed from each one. In this section I explain the original data sets and the creation of the final merged data used for the present project.
In this section, I present the code I used to create and mere the raw
data to create the final dataset. If you want to replicate the code,
just crate a folder named data on your working directory
and download all the data directly from X. This section could be skipped
and you can continue to the EDA where I work with the final merged data
set.
The raw data can be downloaded directly from X, and they provide a good description of all the variables in each data set. They update the data continuously. For this project all the data was downloaded December 3rd. I will provide a codebook of the final data set I created from the raw data as documentation.
You have to download the file: notes-00000.tsv
# packages
library(pacman)
p_load(tidyverse,
lubridate,
naniar,
janitor,
forcats)
rm(list = ls())
# load data ----
# all the data was downloaded on December 3rd 2023
# notes
notes <- read_tsv("data/notes-00000.tsv") %>% clean_names()
## Select the variables that will be used in the model from the notes dataset ----
notes_final <-
notes %>% select(
note_id,
tweet_id,
classification,
trustworthy_sources,
summary,
is_media_note,
created_at_millis
) %>%
mutate(created_at = as.POSIXct(created_at_millis, origin="1970-01-01")) %>%
mutate(w_day = wday(created_at, label = T),
hour = as_factor(hour(created_at)),
note_length = nchar(summary)) %>%
select(-c(created_at_millis,
summary))
You have to download the file: noteStatusHistory-00000.tsv
## status
status <- read_tsv("data/noteStatusHistory-00000.tsv") %>% clean_names()
# the observations in the dataset are almost unique
length(unique(status$note_id))
# however they are all rated as "NEED MORE RATINGS"
duplicated_notes_status <-
status %>% group_by(note_id) %>%
summarise(n_notes = n()) %>%
filter(n_notes>1) %>%
pull(note_id) %>%
format(scientific = F)
status %>% filter(note_id %in% duplicated_notes_status) %>% View()
## select variables from ratings at the notes level ----
rates_summarise <-
bind_rows(r0,
r1,
r2,
r3) %>%
group_by(note_id) %>%
summarise(
ratings = n(),
agreement_rate = sum(agree)/n(),
helpful_rate = sum(helpfulness_level =="HELPFUL",na.rm = T)/n(),
not_helpful_rate = sum(helpfulness_level =="NOT_HELPFUL",na.rm = T)/n(),
somewhat_helpful_rate = sum(helpfulness_level =="SOMEWHAT_HELPFUL",na.rm = T)/n()
)
You have to download the files: ratings-00000.tsv, ratings-00001.tsv, ratings-00002.tsv, ratings-00003.tsv.
## ratings
r0 <- read_tsv("data/ratings-00000.tsv") %>% clean_names()
r1 <- read_tsv("data/ratings-00001.tsv") %>% clean_names()
r2 <- read_tsv("data/ratings-00002.tsv") %>% clean_names()
r3 <- read_tsv("data/ratings-00003.tsv") %>% clean_names()
## select variables from ratings at the notes level ----
rates_summarise <-
bind_rows(r0,
r1,
r2,
r3) %>%
group_by(note_id) %>%
summarise(
ratings = n(),
agreement_rate = sum(agree)/n(),
helpful_rate = sum(helpfulness_level =="HELPFUL",na.rm = T)/n(),
not_helpful_rate = sum(helpfulness_level =="NOT_HELPFUL",na.rm = T)/n(),
somewhat_helpful_rate = sum(helpfulness_level =="SOMEWHAT_HELPFUL",na.rm = T)/n()
)
The previous data sets will be merged in a file named
notes_merged.RData. This data set contains all the
variables used in the analysis, and their description can be found in
the codebook.
# merge data ----
notes_merged <- left_join(notes_final, rates_summarise, by = join_by(note_id)) %>%
# some notes never received ratings, let's replace them with 0
replace_na(list(ratings = 0,
agreement_rate = 0,
helpful_rate = 0,
not_helpful_rate = 0,
somewhat_helpful_rate = 0))
notes_merged <-
left_join(x = notes_merged,
y = status %>%
# select onlye the non duplicated rows
filter(!(note_id %in% duplicated_notes_status)) %>%
# I only analyze the current status
select(note_id,current_status),
by = join_by(note_id))
save(notes_merged,file = "data/notes_merged.RData")
# Load Packages
library(pacman)
p_load(tidyverse,
tidymodels,
recipes,
kknn,
yardstick,
tune,
ggplot2,
ggthemes,
rsample,
parsnip,
workflows,
corrplot
)
load("../data/notes_merged.RData")
In the data set there is only 3 values with missing data. Given the magnitude of the data set it won’t affect the analysis if we just remove them. The ID of the notes with missing data are:
# missing data ----
# there are 3 rows with missing values
missing_cell <- which(is.na(notes_merged), arr.ind = TRUE)
# These are the tweet ids
notes_merged[missing_cell[,1],] %>% select(tweet_id) %>% pull() %>% format(scientific = F) %>% unique()
## [1] "1370126844251435008" "1381787182394904576" "1370110240532930560"
## [4] "1665427964278763520"
# There is no summary in this notes, probably this was a mistake
# There is nothing in the note 1370110240532930560 that had 8 ratings. The other two notes were never rated.
# I remove the missing values, given the content and the number of missing values this shouldn't be an issue
notes_merged <- notes_merged %>% drop_na()
Let’s first check the summary statistics of the number of ratings in each note:
# Analyzing the outcome variable ----
summary(notes_merged$ratings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 15.00 42.85 46.00 6981.00
There is a lot of variability, and it is clear that many notes receive a lot of attention. The median is far less than the mean.
Lets check this looking at the histograms. I separate the notes by the lowest 99% and the highest 1% by number of ratings. We can see that many notes are never rated, but many notes receive some level of attention. For the most popular ones, it is clear that one note creates a lot of distortion.
# 99% of the notes have less than 453 ratings
q_99 <- quantile(notes_merged$ratings,probs = 0.99)
notes_merged %>%
filter(ratings < q_99) %>%
ggplot() +
geom_histogram(aes(x= ratings)) +
labs(
title = "Histogram of the number of Ratings on each Note",
subtitle = "Percentile 99 of the Ratings",
x="Number of Ratings"
) +
theme_bw()
notes_merged %>%
filter(ratings >= q_99) %>%
ggplot() +
geom_histogram(aes(x= ratings)) +
labs(
title = "Histogram of the number of Ratings on each Note",
subtitle = "1% of Notes with more Ratings",
x="Number of Ratings"
) +
theme_bw()
The note with the most ratings is about advertising:
# The note with most ratings
notes_merged %>% filter(ratings>6000) %>% arrange(ratings) %>%
select(tweet_id) %>%
pull() %>%
format(scientific = F)
## [1] "1672908357081124864"
# Correlations ----
# Correlation matrix
notes_cor <- cor(notes_merged %>%
select(-ends_with("id")) %>%
select_if(is.numeric))
# Visualization of correlation matrix
notes_corrplot <- corrplot.mixed(notes_cor,
lower = 'shade', upper = 'pie', order = 'hclust',
addCoef.col = 1, number.cex = 0.7,
tl.pos = "lt"
)
It seems that more ratings are associated with the note rated as helpful, which is expected from how the algorithm is described. Also, it is notes classified as not “helpful” receive more ratings. This is true for looking at the mean, meadian and 10th and 90th quantile.
rmarkdown::paged_table(
notes_merged %>% group_by(current_status) %>%
summarise(mean(ratings), median(ratings),
quantile(ratings,probs = 0.1),
quantile(ratings,probs = 0.9)
)
)
In terms of what the note says about the tweet, more notes say that the tweet is “MISINFORMED_OR_POTENTIALLY_MISLEADING”. This is specially marked for the notes rated as “HELPFUL” were virtually all the notes say the tweet was misinformed or potentially misleading.
rmarkdown::paged_table(
as.data.frame(
table(notes_merged$classification, notes_merged$current_status)) %>%
pivot_wider(names_from = Var1,values_from = Freq) %>%
rename("Current State" = Var2)
)
Finally, the number of ratings is very similar between the classification of notes.
rmarkdown::paged_table(
notes_merged %>% group_by(classification) %>%
summarise(mean(ratings), median(ratings),
quantile(ratings,probs = 0.1),
quantile(ratings,probs = 0.9)
)
)
rmarkdown::paged_table(
notes_merged
)
Now that I have the final data set notes_merged, I can
start the analysis. We have to import some packages including
tidymodels, yardstick and
tune.
# Load Packages
library(pacman)
p_load(tidyverse,
tidymodels,
recipes,
kknn,
yardstick,
tune,
ggplot2,
ggthemes,
rsample,
parsnip,
workflows
)
Now, let’s split the original data set to make the analysis. I
decided to use \(75\%\) of the data for
training, and the sampling is stratified at the outcome variable
ratings.
# To reproduce the results
set.seed(1984)
# Percentage used for the training set
training_percentage <- 0.75
# Splitting the data
split_notes <- initial_split(notes_merged,
prop = training_percentage,
strata = ratings)
train_notes <- training(split_notes)
test_notes <- testing(split_notes)
The proportion of observations in the training set was 0.7499954, and 0.2500046 for the test set. These numbers are closed to the proportion stated.
In the context of Community Notes within machine learning, envisioning the data set as a collection of notes, the process of dividing this data into training, testing, and validation sets becomes analogous to strategizing how to understand and predict the behavior of future notes. The existing notes serve as a sample, providing insights into how contributors have added context to posts in the past. However, it’s imperative not to assume that the future usage of notes will precisely mirror historical patterns.
Much like a training set, a substantial portion of existing notes would be allocated to allow the model to learn patterns, relationships, and features inherent in the data. This phase involves understanding how contributors have historically interacted with posts, detecting common themes, and learning the dynamics of note creation. The testing set, representative of notes yet unseen by the model, acts as a simulated evaluation of the model’s ability to generalize its learning to new instances of notes. This evaluation is critical in anticipating how well the model would adapt to future notes scenarios.
To account for the unpredictability and potential evolution in how contributors may use CN in the future, a validation set becomes paramount. This set serves as a means of fine-tuning the model, preventing it from over fitting the historical data and ensuring that it doesn’t make assumptions based solely on past patterns. The aim is to create a model that is not only proficient in understanding the existing CN but is also equipped to adapt to unforeseen events and new patterns that may emerge in future note creation.
In summary, the process involves training the model on existing notes, testing its ability to generalize to new notes, and fine-tuning its understanding to ensure adaptability to potential shifts in contributor behavior. This approach is crucial for developing a machine learning model that can robustly predict and comprehend the dynamics of CN, both as they are today and in their future iterations. ## Model Building
# recipe ----
rec_reg <- recipe(ratings ~ .,
data = train_notes %>% select(-c(note_id, tweet_id ,summary))) %>%
step_normalize(agreement_rate)
# models ----
##linear model ----
linear_reg <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
# KNN model
k = 7
knn_mod <- nearest_neighbor(neighbors = k) %>%
set_mode("regression") %>%
set_engine("kknn")
# Workflow ----
## linear ----
lm_wkflow <- workflow() %>%
# add model
add_model(linear_reg) %>%
# add receipe
add_recipe(rec_reg)
## KNNn model ----
knn_wkflow <- workflow() %>%
# add model
add_model(knn_mod) %>%
# add receipe
add_recipe(rec_reg)
# fitting models ----
## linear ----
# fit_lm <-
# lm_wkflow %>%
# fit(data = )
# fit_lm
#
# ## KNN ----
# fit_knn <-
# knn_wkflow %>%
# fit(data = )
# fit_knn
#
# # metrics
# notes_metrics <- metric_set(rmse, rsq, mae)
#
# # linear model
# notes_lm_aug <- augment(fit_lm, test_notes)
# notes_metrics(notes_lm_aug, truth = ratings,
# estimate = .pred)
#
# notes_merged %>% ggplot() +
# geom_point(aes(x=agreement_rate,y=ratings))
# knn model
# notes_knn_aug <- augment(fit_knn, test_notes)
# notes_metrics(notes_knn_aug, truth = ratings,
# estimate = .pred)
By Community Notes - https://twitter.com/CommunityNotes/photo, Public Domain, https://commons.wikimedia.org/w/index.php?curid=141534850↩︎
https://www.lemonde.fr/en/pixels/article/2023/07/03/i-spent-one-week-as-an-arbiter-of-truth-on-twitter-s-community-notes-service_6042188_13.html↩︎
https://mashable.com/article/twitter-x-community-notes-misinformation-views-investigation↩︎
By Community Notes - https://github.com/twitter/communitynotes↩︎
By Twitter - Original publication: Screenshot from CommunityNotesContributorImmediate source: https://twitter.com/i/communitynotes, Fair use, https://en.wikipedia.org/w/index.php?curid=75348629↩︎